In professional sports, there is a huge interest in attempting to leverage historic statistics to both predict future outcomes (wins/losses) and explore opportunities for tuning or improving a team or individual’s performance. This data-driven approach to sports has gained a large following over the last decade and entered mass media in the form of fantasy leagues, movies (e.g. Moneyball), and websites/podcasts (e.g. FiveThirtyEight). In this analysis, we will be using a classic baseball data set with the goal to build several different models capable of predicting team wins over a season given on other team stats during that season (i.e. homeruns, strikeouts, base hits, etc).
We will first explore the data looking for issues or challenges (i.e. missing data, outliers, possible coding errors, multicollinearlity, etc). Once we have a handle on the data, we will apply any necessary cleaning steps. Once we have a reasonable dataset to work with, we will build and evaluate three different linear models that predict seasonal wins. Our dataset includes both training data and evaluation data - we will training using the main training data, then evaluate models based on how well they perform against the holdout evaluation data. Finally we will select a final model that offers the best compromise for accuracy and simplicity.
Describe the size and the variables in the moneyball training data set. Consider that too much detail will cause a manager to lose interest while too little detail will make the manager consider that you aren’t doing your job. Some suggestions are given below. Please do NOT treat this as a check list of things to do to complete the assignment. You should have your own thoughts on what to tell the boss. These are just ideas.
The moneyball training set contains 17 columns–with one column our target variable “TARGET_WINS” and 2276 rows, covering baseball team performance statistics from the years 1871 to 2006 inclusive. The data has been adjusted to match the performance of a typical 162 game season. The data-set was entirely numerical and contained no categorical variables.
Below, we created a chart that describes each variable in the dataset and the theoretical effect it will have on the number of wins projected for a team.
Variables of Interest
Given that the Index column had no impact on the target variable, number of wins, it was dropped.
We compiled summary statistics on our data set to better understand the data before modeling.
## TARGET_WINS BATTING_H BATTING_2B BATTING_3B
## Min. : 0.00 Min. : 891 Min. : 69.0 Min. : 0.00
## 1st Qu.: 71.00 1st Qu.:1383 1st Qu.:208.0 1st Qu.: 34.00
## Median : 82.00 Median :1454 Median :238.0 Median : 47.00
## Mean : 80.79 Mean :1469 Mean :241.2 Mean : 55.25
## 3rd Qu.: 92.00 3rd Qu.:1537 3rd Qu.:273.0 3rd Qu.: 72.00
## Max. :146.00 Max. :2554 Max. :458.0 Max. :223.00
##
## BATTING_HR BATTING_BB BATTING_SO BASERUN_SB
## Min. : 0.00 Min. : 0.0 Min. : 0.0 Min. : 0.0
## 1st Qu.: 42.00 1st Qu.:451.0 1st Qu.: 548.0 1st Qu.: 66.0
## Median :102.00 Median :512.0 Median : 750.0 Median :101.0
## Mean : 99.61 Mean :501.6 Mean : 735.6 Mean :124.8
## 3rd Qu.:147.00 3rd Qu.:580.0 3rd Qu.: 930.0 3rd Qu.:156.0
## Max. :264.00 Max. :878.0 Max. :1399.0 Max. :697.0
## NA's :102 NA's :131
## BASERUN_CS BATTING_HBP PITCHING_H PITCHING_HR
## Min. : 0.0 Min. :29.00 Min. : 1137 Min. : 0.0
## 1st Qu.: 38.0 1st Qu.:50.50 1st Qu.: 1419 1st Qu.: 50.0
## Median : 49.0 Median :58.00 Median : 1518 Median :107.0
## Mean : 52.8 Mean :59.36 Mean : 1779 Mean :105.7
## 3rd Qu.: 62.0 3rd Qu.:67.00 3rd Qu.: 1682 3rd Qu.:150.0
## Max. :201.0 Max. :95.00 Max. :30132 Max. :343.0
## NA's :772 NA's :2085
## PITCHING_BB PITCHING_SO FIELDING_E FIELDING_DP
## Min. : 0.0 Min. : 0.0 Min. : 65.0 Min. : 52.0
## 1st Qu.: 476.0 1st Qu.: 615.0 1st Qu.: 127.0 1st Qu.:131.0
## Median : 536.5 Median : 813.5 Median : 159.0 Median :149.0
## Mean : 553.0 Mean : 817.7 Mean : 246.5 Mean :146.4
## 3rd Qu.: 611.0 3rd Qu.: 968.0 3rd Qu.: 249.2 3rd Qu.:164.0
## Max. :3645.0 Max. :19278.0 Max. :1898.0 Max. :228.0
## NA's :102 NA's :286
The first observation is the prevalance of NA’s throughout the dataset. On a cursory view, we can quickly see that the average wins per season for a team is about 81, which is exactly half of the total games played in a typical MLB season. Additionally, batters hit, on average, about 9 base hits per game, pitchers throw about 4 base-on-balls (walks) per game, and pitchers record about 5 strikeouts per game, when calculated out over a 162-game season.
Next, we wanted to get an idea of the distribution profiles for each of the variables.
The distribution profiles sure the prevalence of kurtosis, specifically right skew in variables BASERUN_CS, BASERUN_CB, FIELDING_E, PITCHING_RB, PITCHING_H and PITCHING_SO. These deviations from a traditional normal distribution can be problematic for linear regression assumptions, and thus we might need to transform the data. Furthermore BATTING_HBP, BATTING_HR, PITCHING_HR and BATTING_SO appear bimodal. Bimodal features in a dataset are both problematic and interesting and potentially an area of opportunity and exploration. Bimodal data suggests that there are possibly two different groups or classes of baseball season data. During those seasons, teams tended to score higher or lower for the bimodal feature. Two possibilities immediate come to mind. The bimodal nature could be caused by a rule change such that games before a specific time point had lower values and games after that point had higher values. Unfortunately, we do not have any features on which year the data is associated. The second is that teams either do well or not for the given KPI. Let’s consider BATTING_SO (Batters striking out) - if some teams have better players, they might have a reasonable normal distribution lower than those teams with worse players. We do know that certain teams are able to attract and pay for better players (let’s call them Tier 1) and some teams with lower budgets and less visibility (let’s call them Tier 2). Even with a given team, they might have “good years” and “off years”. It’s possible the distribution of BATTING_SO represents the overlap or superposition of this two distinct curves representing different classes of team.
While we don’t tackle feature engineering in this analysis, if we were performing a more in-depth analysis, these are some possible options.
We have no data linking rows with specific years. We might attempt to locate additional data sets that link performance to year and leverage change point detection to see if bimodal relationships are linked with all teams at specific points in time. If change points are present, that suggests something changed affecting all teams (e.g. new rules that improved or lowered the bimodal feature). We would then create a new categorical feature indicating which class the row data came from, before or after the change point.
We have no data lining rows with specific teams so cannot directly tell if there are more than one Tier of team. Instead, we could explore whether there are correlation between the bimodal nature across different bimodal features (i.e. do rows with lower BATTING_HR line up with rows with lower/higher BATTING_SO) within rows. If there are strong correlations between bimodal features, that suggests there might be two classes of teams (or yearly team performance) and we could create a new categorical variable(s) indicating the probable Tier per row. Alternatively, we could assign a new numerical feature indicating the probability a row fell into which class.
R provides a package, mixtools (see R Vignette) which helps regress mixed models where data can be subdivided into subgroups. Here is a quick example showing a possible mix within BATTING_SO:
## number of iterations= 48
We elected to use box-plots to get an idea of the spread of each variable.
The box-plots reveal significant outliers and uniform (zero-only) values for many of our predictor variables. Outliers will need to imputed if necessary, and sparse data-sets might need to be dropped.
Finally, we wanted to plot scatter plots of each variable versus the target variable, TARGET_WINS, to get an idea of the relationship between them.
The plots indicate some clear relationships, such as hitting more doubles or more home runs clearly improves the number of wins.
Overall our collection of plots also reveal significant issues with the data. Most of the predictor variables are skewed or non-normally distributed, and wlll need to be transformed. There are many data points that contain missing data that will need to be either imputed or discarded. Finally it appears we have some missing data encoded as 0 and some nonsensical outliers.
There is a team with 0 wins in the dataset. This seems unlikely. Many of the hitting categories include teams at 0; it is unlikely that a team hit 0 home runs over the course of a season.
The pitching variables also include many 0’s, for instance there are multiple teams with 0 strikeouts by their pitchers over the season which is extremely unlikely. The pitching data also includes strange outliers such as a team logging 20,000 strikeouts, that would be an average of 160 strikeouts per game which is impossible. Also team pitching walks and team pitching hits have strange outliers.
Lastly, the error variable makes little sense. From my experience watching baseball, teams usually score 2 or less errors per game, which would lead to an overall team error of approximately 320 over the course of a season, which does not match the scale of the error variable.
When we initially viewed the first few rows of the raw data, we already noticed missing data. Let’s assess which fields have missing data.
## values ind
## 1 91.61 BATTING_HBP
## 2 33.92 BASERUN_CS
## 3 12.57 FIELDING_DP
## 4 5.76 BASERUN_SB
## 5 4.48 BATTING_SO
## 6 4.48 PITCHING_SO
## 7 0.00 TARGET_WINS
## 8 0.00 BATTING_H
## 9 0.00 BATTING_2B
## 10 0.00 BATTING_3B
## 11 0.00 BATTING_HR
## 12 0.00 BATTING_BB
## 13 0.00 PITCHING_H
## 14 0.00 PITCHING_HR
## 15 0.00 PITCHING_BB
## 16 0.00 FIELDING_E
Notice that ~91.6% of the rows are missing the BATTING_HBP field - we will just drop this column from consideration. The columns BASERUN_CS (base run caught stealing) and BASERUN_SB (stolen bases) both have missing values. According to baseball history, stolen bases weren’t tracked officially until 1887, so some of the missing data could be from 1871-1886. We will impute those value. There are a high percentage of missing BATTING_SO (batter strike outs) and PITCHING_SO (pitching strike outs) which seem highly unlikely - we will also impute those missing values.
Building off the scatter plots, we wanted to quantify the correlations between our target variable and predictor variable. We will want to choose those with stronger positive or negative correlations. Features with correlations closer to zero will probably not provide any meaningful information on explaining wins by a team.
## values ind
## 1 0.383313355 BATTING_H
## 2 0.285964582 BATTING_2B
## 3 0.225471523 BATTING_BB
## 4 0.186326615 PITCHING_HR
## 5 0.173551988 BATTING_HR
## 6 0.164738136 PITCHING_BB
## 7 0.139073830 BATTING_3B
## 8 0.121110012 BASERUN_SB
## 9 0.105034615 PITCHING_H
## 10 0.009835166 BASERUN_CS
## 11 -0.030050754 FIELDING_DP
## 12 -0.037712096 BATTING_SO
## 13 -0.083637577 FIELDING_E
## 14 -0.089519609 PITCHING_SO
BATTING_H and BATTING_2B have the highest correlation (positive) with TOTAL_WINs; this makes sense given that more hits means more points/runs, and thus likelihood to win the game. The other variables have weak or slightly negative correlation, which can imply they might have little predictive power.
One problem that can occur with multi-variable regression is correlation between variables, or multicolinearity. A quick check is to run correlations between variables.
When we start considering features for our models, we’ll need to account for the correlations between features and avoid including pairs with strong correlations.
We removed the BATTING_HBP field as it was missing >90% of the data and the INDEX field as it offers no information for a model.
Missing values found in BASERUN_CS (Caught Stolen Bases), FIELDING_DP (Double plays), BASERUN_SB (Stolen Bases), BATTING_SO (Batter Strike outs), PITCHING_SO (Pitcher Strike outs) were all replaced with median values. It is highly unlikely that teams had none of these during an entire season.
There are unreasonable outliers found in PITCHING_SO (pitching strikeouts), PITCHIN_H (allowed hits per game), PITCHING_BB (walks), and FIELDING_E (fielding errors) that exceed what is reasonable or possible given standard game length. While specific games might have outliers (e.g. in a game with extra innings), we wouldn’t expect the totals per season to allow for outliers in every game. Given this, we will replace any outliers with the median for the data set. Limits we set included: > 4000 PITCHING_SO (25 strikeouts per game), > 5000 PITCHING_H (30 hits allowed per game), > 2000 PITCHING_BB (13 walks per game) and > 480 FIELDING_E (3 errors per game).
From our plots above, we can see that some of our variables are highly skewed. To address this, we decided to perform some transformations to make them more normally distributed. Here are some plots to demonstrate the changes in distributions before and after the transformations:
Using the training data set, build at least three different multiple linear regression models, using different variables (or the same variables with different transformations). Since we have not yet covered automated variable selection methods, you should select the variables manually (unless you previously learned Forward or Stepwise selection, etc.). Since you manually selected a variable for inclusion into the model or exclusion into the model, indicate why this was done.
Discuss the coefficients in the models, do they make sense? For example, if a team hits a lot of Home Runs, it would be reasonably expected that such a team would win more games. However, if the coefficient is negative (suggesting that the team would lose more games), then that needs to be discussed. Are you keeping the model even though it is counter intuitive? Why? The boss needs to know.
##
## Call:
## lm(formula = TARGET_WINS ~ ., data = clean_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -62.246 -8.040 0.149 8.459 64.706
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 19.8791382 5.5609672 3.575 0.000358 ***
## BATTING_H 0.0435068 0.0037271 11.673 < 2e-16 ***
## BATTING_2B -0.0181980 0.0091059 -1.998 0.045785 *
## BATTING_3B 0.0887800 0.0166055 5.346 9.87e-08 ***
## BATTING_HR -0.0498772 0.0307896 -1.620 0.105385
## BATTING_BB 0.0668013 0.0051197 13.048 < 2e-16 ***
## BATTING_SO 0.0043819 0.0039536 1.108 0.267841
## BASERUN_SB 0.0199677 0.0041255 4.840 1.39e-06 ***
## BASERUN_CS 0.0054960 0.0154463 0.356 0.722013
## PITCHING_H 0.0031034 0.0009168 3.385 0.000723 ***
## PITCHING_HR 0.0875913 0.0272043 3.220 0.001301 **
## PITCHING_BB -0.0343166 0.0043823 -7.831 7.39e-15 ***
## PITCHING_SO -0.0059484 0.0028921 -2.057 0.039826 *
## FIELDING_E -0.0431197 0.0051761 -8.331 < 2e-16 ***
## FIELDING_DP -0.1453152 0.0132805 -10.942 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.01 on 2260 degrees of freedom
## Multiple R-squared: 0.3141, Adjusted R-squared: 0.3099
## F-statistic: 73.93 on 14 and 2260 DF, p-value: < 2.2e-16
## 2.5 % 97.5 %
## (Intercept) 8.974002397 30.7842739439
## BATTING_H 0.036197876 0.0508157046
## BATTING_2B -0.036054813 -0.0003411237
## BATTING_3B 0.056216362 0.1213435818
## BATTING_HR -0.110256003 0.0105016227
## BATTING_BB 0.056761459 0.0768412290
## BATTING_SO -0.003371201 0.0121349175
## BASERUN_SB 0.011877551 0.0280577880
## BASERUN_CS -0.024794453 0.0357865111
## PITCHING_H 0.001305648 0.0049012431
## PITCHING_HR 0.034243211 0.1409393798
## PITCHING_BB -0.042910318 -0.0257228679
## PITCHING_SO -0.011619907 -0.0002768375
## FIELDING_E -0.053270144 -0.0329693449
## FIELDING_DP -0.171358419 -0.1192719091
##
## Call:
## lm(formula = TARGET_WINS ~ BATTING_H + BATTING_2B + BATTING_3B +
## BATTING_BB + BASERUN_SB + PITCHING_H + PITCHING_HR + PITCHING_BB +
## PITCHING_SO + FIELDING_E + FIELDING_DP, data = clean_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -61.051 -7.965 0.164 8.359 65.549
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 21.7189057 4.9999664 4.344 1.46e-05 ***
## BATTING_H 0.0428227 0.0036457 11.746 < 2e-16 ***
## BATTING_2B -0.0172282 0.0089438 -1.926 0.054197 .
## BATTING_3B 0.0917609 0.0161863 5.669 1.62e-08 ***
## BATTING_BB 0.0643346 0.0041503 15.501 < 2e-16 ***
## BASERUN_SB 0.0209334 0.0039963 5.238 1.77e-07 ***
## PITCHING_H 0.0027985 0.0008467 3.305 0.000964 ***
## PITCHING_HR 0.0466418 0.0083051 5.616 2.19e-08 ***
## PITCHING_BB -0.0323388 0.0034282 -9.433 < 2e-16 ***
## PITCHING_SO -0.0031173 0.0016399 -1.901 0.057446 .
## FIELDING_E -0.0425346 0.0051451 -8.267 2.31e-16 ***
## FIELDING_DP -0.1466897 0.0132112 -11.103 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.01 on 2263 degrees of freedom
## Multiple R-squared: 0.3132, Adjusted R-squared: 0.3098
## F-statistic: 93.8 on 11 and 2263 DF, p-value: < 2.2e-16
We can use a simple decision tree to help visualize feature importance.
We can also explicitly call out our variable importance using the Caret package which will determine our feature importance for each of the multiple linear regression models individually.
In our second model, we decided to utilize some of our transformed variables to compare against our initial model. As a result of the transformations, there were just a few values that needed to be imputed – we imputed using the mean value for the given variable.
Then, similar to model 1, we used Stepwise selection to determine feature importance and select the simplest model possible.
##
## Call:
## lm(formula = TARGET_WINS ~ ., data = clean_trans_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -60.074 -7.938 0.122 8.547 64.835
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.054e+01 1.907e+01 0.553 0.581
## BATTING_H 4.316e-02 3.717e-03 11.611 < 2e-16 ***
## BATTING_2B -1.953e-02 9.047e-03 -2.159 0.031 *
## BATTING_3B_transform 6.194e+00 9.093e-01 6.812 1.23e-11 ***
## BATTING_HR_transform 3.564e-01 3.070e-01 1.161 0.246
## BATTING_BB 6.532e-02 5.689e-03 11.482 < 2e-16 ***
## BATTING_SO -4.883e-03 3.794e-03 -1.287 0.198
## BASERUN_SB_transform 2.742e+00 5.782e-01 4.743 2.24e-06 ***
## BASERUN_CS 1.887e-03 1.605e-02 0.118 0.906
## PITCHING_H 1.361e-03 9.483e-04 1.435 0.151
## PITCHING_HR 2.305e-02 1.804e-02 1.278 0.201
## PITCHING_BB_transform -1.641e+01 2.391e+00 -6.862 8.71e-12 ***
## PITCHING_SO_transform 1.444e+00 2.512e+00 0.575 0.566
## FIELDING_E_transform 2.590e+02 3.210e+01 8.070 1.13e-15 ***
## FIELDING_DP -1.438e-01 1.363e-02 -10.549 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.03 on 2260 degrees of freedom
## Multiple R-squared: 0.3121, Adjusted R-squared: 0.3079
## F-statistic: 73.25 on 14 and 2260 DF, p-value: < 2.2e-16
## 2.5 % 97.5 %
## (Intercept) -2.685561e+01 47.933605420
## BATTING_H 3.587284e-02 0.050452044
## BATTING_2B -3.727083e-02 -0.001787488
## BATTING_3B_transform 4.411081e+00 7.977361789
## BATTING_HR_transform -2.456447e-01 0.958412393
## BATTING_BB 5.416108e-02 0.076472137
## BATTING_SO -1.232372e-02 0.002557970
## BASERUN_SB_transform 1.608397e+00 3.876237934
## BASERUN_CS -2.958376e-02 0.033358259
## PITCHING_H -4.984537e-04 0.003220807
## PITCHING_HR -1.232196e-02 0.058418908
## PITCHING_BB_transform -2.109473e+01 -11.718068356
## PITCHING_SO_transform -3.482596e+00 6.369685391
## FIELDING_E_transform 1.960650e+02 321.942566498
## FIELDING_DP -1.705245e-01 -0.117061149
##
## Call:
## lm(formula = TARGET_WINS ~ BATTING_H + BATTING_2B + BATTING_3B_transform +
## BATTING_BB + BASERUN_SB_transform + PITCHING_H + PITCHING_HR +
## PITCHING_BB_transform + FIELDING_E_transform + FIELDING_DP,
## data = clean_trans_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -60.957 -8.032 0.202 8.466 61.564
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.938e+01 1.224e+01 1.583 0.1136
## BATTING_H 4.542e-02 3.040e-03 14.941 < 2e-16 ***
## BATTING_2B -2.219e-02 8.626e-03 -2.572 0.0102 *
## BATTING_3B_transform 6.380e+00 8.992e-01 7.095 1.72e-12 ***
## BATTING_BB 6.754e-02 4.390e-03 15.384 < 2e-16 ***
## BASERUN_SB_transform 2.454e+00 5.075e-01 4.835 1.42e-06 ***
## PITCHING_H 1.567e-03 8.314e-04 1.885 0.0595 .
## PITCHING_HR 3.580e-02 7.715e-03 4.640 3.68e-06 ***
## PITCHING_BB_transform -1.709e+01 2.004e+00 -8.527 < 2e-16 ***
## FIELDING_E_transform 2.594e+02 3.100e+01 8.369 < 2e-16 ***
## FIELDING_DP -1.386e-01 1.326e-02 -10.454 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.03 on 2264 degrees of freedom
## Multiple R-squared: 0.3113, Adjusted R-squared: 0.3082
## F-statistic: 102.3 on 10 and 2264 DF, p-value: < 2.2e-16
We can assess variable importance of our model with our original dataset (model 1) to our transformed dataset (model 2) as shown below. We notice that “Batting_BB” and “Batting_H” are much closer together in terms of weight in predicting target_wins than in model 1 (pre-transformed)
For the multiple linear regression model, will you use a metric such as Adjusted R2, RMSE, etc.? Be sure to explain how you can make inferences from the model, discuss multi-collinearity issues (if any), and discuss other relevant model output. Using the training data set, evaluate the multiple linear regression model based on (a) mean squared error, (b) R2, (c) F-statistic, and (d) residual plots. Make predictions using the evaluation data set.
# =====================================================================================
# Load Libraries and Define Helper functions
# =====================================================================================
library(MASS)
library(rpart.plot)
library(ggplot2)
library(ggfortify)
library(gridExtra)
library(forecast)
library(fpp2)
library(fma)
library(kableExtra)
library(e1071)
library(mlbench)
library(ggcorrplot)
library(DataExplorer)
library(timeDate)
library(caret)
library(GGally)
library(corrplot)
library(RColorBrewer)
library(tibble)
library(tidyr)
library(tidyverse)
library(dplyr)
library(reshape2)
library(mixtools)
#' Print a side-by-side Histogram and QQPlot of Residuals
#'
#' @param model A model
#' @examples
#' residPlot(myModel)
#' @return null
#' @export
residPlot <- function(model) {
layout(matrix(c(1,1,2,3), 2, 2, byrow = TRUE))
plot(residuals(model))
hist(model[["residuals"]], freq = FALSE, breaks = "fd", main = "Residual Histogram",
xlab = "Residuals",col="lightgreen")
lines(density(model[["residuals"]], kernel = "ep"),col="blue", lwd=3)
curve(dnorm(x,mean=mean(model[["residuals"]]), sd=sd(model[["residuals"]])), col="red", lwd=3, lty="dotted", add=T)
qqnorm(model[["residuals"]], main = "Residual Q-Q plot")
qqline(model[["residuals"]],col="red", lwd=3, lty="dotted")
par(mfrow = c(1, 1))
}
# =====================================================================================
# Load Data set
# =====================================================================================
# Load Moneyball baseball dataset
df <- read.csv('datasets/moneyball-training-data.csv')
df_eval <- read.csv('datasets/moneyball-evaluation-data.csv')
# Remove superfluous TEAM_ from column names
names(df) <- names(df) %>% str_replace_all('TEAM_', '')
# Drop the INDEX column - this won't be useful
df <- df %>% dplyr::select(-INDEX)
# =====================================================================================
# Summary Stats
# =====================================================================================
# Display summary statistics
summary(df)
# =====================================================================================
# Distributions
# =====================================================================================
# Prepare data for ggplot
gather_df <- df %>% gather(key = 'variable', value = 'value')
# Histogram plots of each variable
ggplot(gather_df) +
geom_histogram(aes(x=value, y = ..density..), bins=30) +
geom_density(aes(x=value), color='blue') +
facet_wrap(. ~variable, scales='free', ncol=4)
# Select BATTING_SO column and remove any missing data
df_mix <- df %>%
dplyr::select(BATTING_SO) %>%
drop_na()
# Calculate mixed distributions for BATTING_SO
batting_so_mix <- normalmixEM(df_mix$BATTING_SO,
lambda = .5,
mu = c(400, 1200),
sigma = 5,
maxit=60)
# Simple plot to illustrate possible bimodal mix of groups
plot(batting_so_mix,
whichplots = 2,
density=TRUE,
main2="BATTING_SO Possible Distributions",
xlab2="BATTING_SO")
# =====================================================================================
# Boxplots
# =====================================================================================
# Prepare data for ggplot
gather_df <- df %>% gather(key = 'variable', value = 'value')
# Boxplots for each variable
ggplot(gather_df, aes(variable, value)) +
geom_boxplot() +
facet_wrap(. ~variable, scales='free', ncol=6)
# =====================================================================================
# Variable Plots
# =====================================================================================
# Plot scatter plots of each variable versus the target variable
featurePlot(df[,2:ncol(df)], df[,1], pch = 20)
# =====================================================================================
# Missing Data
# =====================================================================================
# Identify missing data by Feature and display percent breakout
missing <- colSums(df %>% sapply(is.na))
missing_pct <- round(missing / nrow(df) * 100, 2)
stack(sort(missing_pct, decreasing = TRUE))
# Drop the BATTING_HBP field
df <- df %>% select(-BATTING_HBP)
# We have chosen to impute the median as there are strong outliers that may skew the mean. Could revisit for advanced imputation via prediction later.
no_outlier_df <- df
# 4000 strikeouts is an average of 25 strikeouts per game, which is ridiculous.
no_outlier_df$PITCHING_SO <- ifelse(no_outlier_df$PITCHING_SO > 4000, NA, no_outlier_df$PITCHING_SO)
# 5000 hits is an average of 30 hits allowed per game, which is also ridiculous.
no_outlier_df$PITCHING_H <- ifelse(no_outlier_df$PITCHING_H > 5000, NA, no_outlier_df$PITCHING_H)
# 2000 walks is an average of 13 walks per game which is unlikely.
no_outlier_df$PITCHING_BB <- ifelse(no_outlier_df$PITCHING_BB > 2000, NA, no_outlier_df$PITCHING_BB)
# more than 480 errors is an average of 3 per game which is unlikely.
no_outlier_df$FIELDING_E <- ifelse(no_outlier_df$FIELDING_E > 480, NA, no_outlier_df$FIELDING_E)
clean_df <- no_outlier_df %>% impute(what = 'median') %>% as.data.frame()
# From the plots, the team with 0 wins has 0 in multiple other categories. I believe this is missing data, not valid data.
clean_df <- clean_df %>% filter(TARGET_WINS != 0)
gather_clean_df <- clean_df %>% gather(key = 'variable', value = 'value', -TARGET_WINS)
# =====================================================================================
# Feature-Target Correlations
# =====================================================================================
# Show feature correlations/target by decreasing correlation
stack(sort(cor(clean_df[,1], clean_df[,2:ncol(clean_df)])[,], decreasing=TRUE))
# =====================================================================================
# Multicolinearity
# =====================================================================================
correlation = cor(clean_df, use = 'pairwise.complete.obs')
corrplot(correlation, 'ellipse', type = 'lower', order = 'hclust',
col=brewer.pal(n=8, name="RdYlBu"))
# =====================================================================================
# Transform non-normal variables
# =====================================================================================
# created empty data frame to store transformed variables
df_temp <- data.frame(matrix(ncol = 1, nrow = length(clean_df$TARGET_WINS)))
# performed boxcox transformation after identifying proper lambda
df_temp$BATTING_3B <- clean_df$BATTING_3B
batting3b_lambda <- BoxCox.lambda(clean_df$BATTING_3B)
df_temp$BATTING_3B_transform <- log(clean_df$BATTING_3B)
# performed boxcox transformation after identifying proper lambda
df_temp$BATTING_HR <- clean_df$BATTING_HR
battingHR_lambda <- BoxCox.lambda(clean_df$BATTING_HR)
df_temp$BATTING_HR_transform <- BoxCox(clean_df$BATTING_HR, battingHR_lambda)
# performed a log transformation
df_temp$PITCHING_BB <- clean_df$PITCHING_BB
df_temp$PITCHING_BB_transform <- log(clean_df$PITCHING_BB)
# performed a log transformation
df_temp$PITCHING_SO <- clean_df$PITCHING_SO
df_temp$PITCHING_SO_transform <- log(clean_df$PITCHING_SO)
# performed an inverse log transformation
df_temp$FIELDING_E <- clean_df$FIELDING_E
df_temp$FIELDING_E_transform <- 1/log(clean_df$FIELDING_E)
# performed a log transformation
df_temp$BASERUN_SB <- clean_df$BASERUN_SB
df_temp$BASERUN_SB_transform <- log(clean_df$BASERUN_SB)
df_temp <- df_temp[, 2:13]
histbox <- function(df, box) {
par(mfrow = box)
ndf <- dimnames(df)[[2]]
for (i in seq_along(ndf)) {
data <- na.omit(unlist(df[, i]))
hist(data, breaks = "fd", main = paste("Histogram of", ndf[i]),
xlab = ndf[i], freq = FALSE)
lines(density(data, kernel = "ep"), col = 'red')
}
par(mfrow = c(1, 1))
}
histbox(df_temp, c(6, 2))
# =====================================================================================
# Model 1 (with and without stepAIC)
# =====================================================================================
# Model 1 - Include all features and leverage stepAIC to help identify ideal features
multi_lm <- lm(TARGET_WINS ~ ., clean_df)
(lm_s <- summary(multi_lm))
confint(multi_lm)
residPlot(lm_s)
mult_lm_final <- stepAIC(multi_lm, direction = "both",
scope = list(upper = multi_lm, lower = ~ 1),
scale = 0, trace = FALSE)
# Display Model 1 Summary
(lmf_s <- summary(mult_lm_final))
# Display Model 1 REsidual plots
residPlot(lmf_s)
# Model 1 - Simple decision tree for feature importance
dt_m <- caret::train(TARGET_WINS ~ ., data = clean_df, tuneLength = 19L, method = 'rpart2')
rpart.plot(dt_m$finalModel, type = 1L)
# Model 1 - Simple `caret` plot to show feature importance
VarImpPlot<-function(LINEARMODEL,TITLE){
varImp(LINEARMODEL) %>% as.data.frame() %>%
ggplot(aes(x = reorder(rownames(.),desc(Overall)), y = Overall))+
geom_col(aes(fill = Overall))+
theme(panel.background = element_blank(),
panel.grid = element_blank(),
axis.text.x = element_text(angle = 90))+
scale_fill_gradient()+
labs(title = TITLE,
x = "Parameter",
y = "Relative Importance")
}
# =====================================================================================
# Model 2 (with and without stepAIC)
# =====================================================================================
# Model 2 - Create data frame with Transformed features
clean_trans_df <- data.frame(cbind(TARGET_WINS = clean_df$TARGET_WINS,
BATTING_H = clean_df$BATTING_H,
BATTING_2B = clean_df$BATTING_2B,
BATTING_3B_transform = df_temp$BATTING_3B_transform,
BATTING_HR_transform = df_temp$BATTING_HR_transform,
BATTING_BB = clean_df$BATTING_BB,
BATTING_SO = clean_df$BATTING_SO,
BASERUN_SB_transform = df_temp$BASERUN_SB_transform,
BASERUN_CS = clean_df$BASERUN_CS,
PITCHING_H = clean_df$PITCHING_H,
PITCHING_HR = clean_df$PITCHING_HR,
PITCHING_BB_transform = df_temp$PITCHING_BB_transform,
PITCHING_SO_transform = df_temp$PITCHING_SO_transform,
FIELDING_E_transform = df_temp$FIELDING_E_transform,
FIELDING_DP = clean_df$FIELDING_DP))
is.na(clean_trans_df) <- sapply(clean_trans_df, is.infinite)
mean = mean(clean_trans_df$BATTING_3B_transform, na.rm = TRUE)
mean2 = mean(clean_trans_df$BASERUN_SB_transform, na.rm = TRUE)
mean3 = mean(clean_trans_df$PITCHING_SO_transform, na.rm = TRUE)
clean_trans_df$BATTING_3B_transform[is.na(clean_trans_df$BATTING_3B_transform)] <- mean
clean_trans_df$BASERUN_SB_transform[is.na(clean_trans_df$BASERUN_SB_transform)] <- mean2
clean_trans_df$PITCHING_SO_transform[is.na(clean_trans_df$PITCHING_SO_transform)] <- mean3
# Model 2 - Build linear model
multi_lm_2 <- lm(TARGET_WINS ~ ., clean_trans_df)
# Model 2 - Show model summary info
(lm_s_2 <- summary(multi_lm_2))
# Model 2 - Show Confidence and Residual Plots
confint(multi_lm_2)
residPlot(lm_s_2)
# Model 2 - use stepAIC to select significant features
mult_lm_final_2 <- stepAIC(multi_lm_2, direction = "both",
scope = list(upper = multi_lm_2, lower = ~ 1),
scale = 0, trace = FALSE)
# Model 2 - Show summary of revised model with only significant features included
(lmf_s_2 <- summary(mult_lm_final_2))
# model 2 - Show residual plots
residPlot(lmf_s_2)
# Model 2 - Show feature importance in final model after stepAIC
VarImpPlot(mult_lm_final_2, "Model 2 LM Variable Importance")
# =====================================================================================
# Select Models
# =====================================================================================